Dataset Overview

we have roughly 300k observations and 18 columns in the dataset.

check if there are null values

this is a cleaned dataset with no null values

we do see some unusual values such as a BMI of 94.85, or sleep time of 24 hours. We might need to drop those variables in the data preprocessing phase.

EDA

as shown in the pie chart, the dataset is quite imbalanced with roughly 9 percent yes and 91 percent no. We need to balance yes/no for better modeling result.

the BMI for people who have heart disease are higher than those who do not.

In general, people are more likely to get diseases when they are getting older.

Data Preprocessing

I notice that there are some duplicates in the dataset, so i drop them to prevent overfitting of model

319795 rows for the original dataset

after dropping duplicates, there are 301717 rows

Drop unnecessary columns such as race

Encode Categorical Variables

GenHealth

AgeCategory

Diabetic

HeartDisease

One-hot encoding

Check Correlation for the variables

no variables seem highly correlated, so we keep all of them

Model

train test split

Over Sampling

Since the dataset is extremely imbalanced, we need to balance the dataset before running the model. Here I use SMOTE method for oversampling.

standardize

apply models

Model Assessment

In this case, recall would probably be more important than precision or accuracy since there is a high cost associated with False Negative. Recall calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive), so we can detect those actual cases and possibly save lifes.

Undersampling

it is also worth a try of using usdersampling techniques to balance dataset.

Standardize

Both oversampling and undersampling seem to have similar performance in test recall, so I choose undersampling here for less training time and i do not want synthetic data bias my prediction.

SVC

Logistic Regression

XGBoost

tune the model for best recall

rerun model with best parameter

Some most important features are age, general health, physical health, sex,diabetic, stroke, smoking

Conclusion

Our final model will be tuned XGBoost due to its highest recall of 96%. That means we will be able to capture 96% of the patients who actually have heart disease. The challenge now is how to deal with low precision of the model.The model now seems hard to distinguish positive cases from negative cases. Increase the amount as well as diversity of data is definitely needed.

Our model also tells us that forming a good habit is definitely beneficial for our health. Giving up bad habits like smoking and do more physical exercises will reduce the risk of heart and vascular diseases by dozens of times.